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Abstract 



Recently, many regularized procedures have been proposed for variable selection 
in linear regression, but their performance depends on the tuning parameter se- 
lection. Here a criterion for the tuning parameter selection is proposed, which 
combines the strength of both stability selection and cross-validation and there- 
fore is referred as the prediction and stability selection (PASS). The selection 
consistency is established assuming the data generating model is a subset of the 
full model, and the small sample performance is demonstrated through some 
simulation studies where the assumption is either held or violated. 
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1 Introduction 1 

Many regularized procedures produce sparse solution and therefore are sometimes used for 2 

variable selection in linear regression. Breiman (1996) showed that regularized procedures 3 

are more stable than subset selection. Such procedures include LASSO (Tibshirani, 1996), 4 

SCAD (Fan and Li, 2001), and adaptive LASSO (Zou, 2006). However, their performance 5 

depends crucially on the tuning parameter selection. 6 

This manuscript is not intended to add a new regularized procedure to the long list. 7 

Rather, it is aimed to propose a new method for selecting an "appropriate" tuning parame- 8 

ter, which is crucial in any existing regularized procedure. The meaning of appropriateness 9 

depends on whether the purpose of regularization is prediction or variable selection. 10 

For prediction, popular methods for the tuning parameter selection include C p (Mallows, 11 

1973), cross-validation (Stone, 1974), and generalized cross-validation (Craven and Wahba, 12 

1979). However, for prediction, it is way too simple to consider only one regularization 13 

procedure based on one selected tuning parameter, and usually it is more powerful to consider u 

complicated procedures such as boosting and averaging (Hastie et ai, 2009). Therefore, this 15 

manuscript is focused on the tuning parameter selection for variable selection. m 

For variable selection, the most popular method for the tuning parameter selection is BIC 17 

(Schwarz, 1978). The selection consistency of BIC for SCAD was shown in several papers is 

(e.g., Wang et al, 2007, Wang et ai, 2009, and Zhang et al, 2010). Here the selection 19 

consistency means that the probability of selecting the data generating model is tending to 20 

one when the sample size goes to infinity, assuming that the data generating model is a subset 21 

of the full model. This manuscript is to propose an alternative to BIC. The new method is 22 

selection consistent for a large group of regularized procedures. 23 

Simple put, the new method combines the strength of both stability selection and cross- 24 

validation, and therefore it is referred as the prediction and stability selection (PASS). Here 25 
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the stability selection is a recent idea for variable selection. Bach (2008) proposed Bolasso to 1 

enhance the original LASSO through the bootstrap; but it requires knowing the exact root-n 2 

regularization decay. Meinshausen and Buhlmann (2010) proposed their version of stability 3 

selection, in which a super tuning parameter, cutoff (pre-set as 0.8 there), needs to be 4 

selected. Most recently, Sun et al. (2012) proposed Kappa selection; however, there is also a 5 

super tuning parameter, threshold a n (pre-set as 0.1 there), needed to be selected. 6 

This manuscript is a note on Sun et al. (2012), aimed at avoiding the selection of threshold 7 

a n by incorporating the strength of cross-validation. The remainder of the manuscript is or- 8 

ganized as follows. Section 2 reviews some asymptotic results in some regularized procedures. 9 

Section 3 develops a new criterion for tuning parameter selection and Section 4 examines its 10 

selection consistency. Numerical results are in Section 5 and some discussion is in Section 6. 11 

2 Regularized procedures 12 

Consider variable selection in linear regression, 13 

Vi =x' i /3 + e i , i = l,-- - ,n, (1) 

where (3 = (f3\, • • • ,f3 p )', E(ei) = 0, and Var(ei) = a 2 . Assume both response and covariates u 

are centered and then no intercept is included. Let A = {j : /3j ^ 0} and assume /3 is sparse 15 

in the sense that |^4| = q < p. Without loss of generality, assume A = {1, ■ ■ ■ , q}. ie 

A general framework for the regularized regression is 17 

n p 

fix = argmin^(yi-^7) 2 /™ + X^(W)> (2) 
7e i=i 3=1 

where px(-) is a regularization term encouraging sparsity in (3. In LASSO, p\{\Pj\) = A|/3j|. is 

In SCAD, p' x {9) = \{I(9 < A) + 1(6 > A)}. And in adaptive LASSO, p\(\Pj\) = 19 

A|/3j|/|/3j|, where j3j is some initial estimate of /3j. 20 
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If A\ = {j '■ P\j 7^ 0} is used to estimate A, all the three aforementioned regulariza- i 

tion procedures have been shown to be selection consistent under various conditions with 2 

appropriately A = A n , where subscript n emphasize the dependence on sample size n. 3 

For simplification, in this manuscript, consider the case where p is fixed. It has been a 

shown that for all these three regularization procedures, there exist r n and s n such that the 5 

procedures are selection consistent if r n ~< \ n -< s n , where a n < b n means a n = o(b n ). This e 

fact might also hold for many other regularization procedures. Specifically, for LASSO under 7 

the irrepresentable condition, r n x and s n x 1 (Zhao and Yu, 2006), where a n X b n s 



means a n = 0(b n ) and b n = 0(a n ). In addition, r n x ^/\/n and s n x 1 for SCAD (Fan and 9 
Li, 2001) and r n x 1/n and s n x l/v^ f° r adaptive LASSO (Zou, 2006). In the following, 10 
five mutually exclusive cases of A n are considered. For LASSO, refer to Bach (2008), while 11 



for the other two, refer to Sun et al. (2012). 12 

Case 1: If A n y s n , then f3\ n = with probability tending to one. 13 

Case 2: If A n x s m then f3\ n — > 70 7^ /3, where 70 is fixed and its sign pattern may or u 

may not be the same as that of (5. 15 

Case 3: If r n -< A n -< s n , then (3\ n — > f3 and the sign pattern of f3\ n is consistent with w 

that of P with probability tending to one. Here the irrepresentable condition is needed for 17 

LASSO but not for the other two. is 

Case 4-' If A n x r n , then the sign pattern of fi\ n is consistent with that of /3 on A with 19 

probability tending to one, while for all sign patterns consistent with that of (3 on A, the 20 

probability of obtaining this pattern is tending to a limit in (0, 1). 21 

Case 5: If A n -< r n , then f3\ n — > (3 and A\ n = {!,-•• ,p} with probability tending to one. 22 
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3 Prediction and stability selection (PASS) 1 

A good criterion should intend to select A n from case 3; selecting A n from cases 1 or 2 might 2 

lead to under-fitting while from cases 4 or 5 might lead to over-fitting. If the two degenerate 3 

cases (1 and 5) are pre-excluded, the criterion designed in this section incorporates cross- 4 

validation, which avoids under-fitting, and Kappa selection proposed in Sun et al. (2012), 5 

which avoids over-fitting. 6 

To describe this criterion, consider any aforementioned regularized procedure with tuning 7 

parameter A. First of all, randomly partition the dataset {(yi,Xi), ■ ■ ■ ,(y n ,x n )} into two s 

halves, Z x = {(y^ , x* x ) , ■ ■ ■ , (y* m X™)} and z i = {(Vm+ii x m+i)i ' ' ' .(sCOh where m = 9 

[n/2\. Based on Z\ and Z 2 respectively, Pk\ is obtained via ([2]) and then submodel Ak\ is 10 

selected, k = 1, 2. 11 

If A were from Case 4, both submodels, Ak\,k = 1,2, would include non-informative 12 

variables randomly. The agreement of these two submodels can be measured by Cohen's 13 

Kappa Coefficient (Cohen, 1960), 14 

2 2 Pr(a)-Pr(e) 

where Pr{a) = {\A lx nA 2 x\ + |^ A n4l)/P and Pr(e) = (|^ia||^2a| + |^ A ||^ A |)/p 2 . 

On the other hand, if A were from Case 2, either submodels, Ak\, k = 1, 2, might exclude ie 

some informative variable. To avoid such under-fitting, consider cross-validation, n 

m n 

CV(Z 1 , Z 2 ; A) = {^(y, - x'^f + fa ~ x> iM 2 }/ n - ( 4 ) 

i=l i=m+l 

In addition, submodel A is assumed to be sparse and contain at least one variable, so is 

k(A\\,A2\) will be set as —1 if both A\\ and A 2 \ are empty or both are full (that is, the 19 

two degenerate cases, Cases 1 and 5, will be pre-excluded). 20 

Now we are ready to describe the PASS algorithm, which runs the following five steps. 21 
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Step 1: Randomly partition the original dataset into two halves, Zf 3 and 

Step 2: Based on Zf 1 and Z^ b respectively, two sub- models, A*^ x and A*[ x , are selected. 

Step 3: Calculate n(Af x ,Af x ) and CV (Z* b , Zf ; X) . 

Step 4-' Repeat Steps 1-3 for B times and obtain the following ratio, 

B B 

PASS(X) =Y,<Af x ,A? x )/Y,CV(Z?,Z?-,\). (5) 

6=1 b=l 

Step 5: Compute PASS(X) on a grid of A and select A = arg max^ PASS(X). 

4 Selection consistency 

Recall the existence of those r n and s n in Section 2, which plays an important role here. The 
underlying assumptions are not stated in the following theorem, but they can be found in 
Bach (2008) for LASSO, Fan and Li (2001) for SCAD, and Zou (2006) for adaptive LASSO. 
As discussed in Section 3, Cases 1 and Case 5 can be pre-excluded by the definition of k, so 
it suffices to show that the PASS can distinguish Case 3 from Cases 2 and 4. 

Proposition 1 For any X n such that r n < X n < s n , as n — >■ oo and B — > oo, 

Pr{PASS(s n ) < PASS(X n )} -> 1 and Pr{PASS(r n ) < PASS(X n )} -»• 1. 

Heuristic proof: First, by Chebyshev's inequality, for identically distributed variables, X n i,, b = 
1, • • • , B, if Var{X nb ) < C and Corr(X nl , X n2 ) -> 0, then £f =1 X nb /B - E(X nl ) -> 0. 

If r n -< X n -< s n , by the result in Case 3, Ai\ = ^2A n = ^ with probability tending to 
one, and therefore by Lebesgue's dominated theorem, E{k(A*\ ,A%\ )} — > 1- In order to 
apply Lebesgue's dominated theorem to examine the asymptotic property of cross-validation, 
assume that in (j3D, /?2A an d /?2A are bounded manually by some large value, say M = 10 6 . 
Then, by Lebesgue's dominated theorem, E{CV(Z^ 1 , Z^ 1 ; A n )} — > a 2 . 

When A = s n , E{CV(Z{ 1 , Z^ 1 ; s n )} — > a 2 + c\\/3 -j \\ 2 , where 70 is defined in Case 2 and 
c is the limit of the minimum eigenvalue of ^ X{x\jn (fixed design matrix) or the minimum 
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eigenvalue of E{x\x' l ) (random design matrix). And trivially, K(All n ,All n ) < 1. Therefore, 1 

Pr{PASS(s n ) < PASS(X n )} -> 1. 2 

When A = r n , by the result in Case 4, Pr(Af Tn + A£j -> 5 > 0. Note that if JD^ / 3 

thenK(^ n ,^J < 1-1/p. Thenhm^oo^K^,^)} < (l-8)+{l-l/p)6 < 1. 4 

And trivially, £{CV(Zf , Z 2 *Vn)} -> cr 2 . Therefore, Pr{PAS5(r n ) < PASS(X n )} -»• 1. □ 5 

5 Numerical results 6 

In this section, via simulations, the PASS method is compared with Cp, 10-fold cross- 7 

validation (CV), generalized cross-validation (GCV), and BIC. AIC is not compared because s 

it is equivalent to Cp here. R package pass is created for implementing both the PASS method 9 

proposed here and the Kappa selection method proposed in Sun et al. (2012). After A is 10 

selected by one of the above criterions, submodel A^ is selected based on the non-zero compo- 11 

nents of obtained from (|2|). In addition, the OLS estimate based on only the selected vari- 12 

ables, /3, is also obtained, along with its relative prediction error, RPE = E(x' /3 — x' f3) 2 / 'a 2 , 13 

where xo is i.i.d. with x%. 14 

Three scenarios are considered. In Scenario I, the data generating model is a subset of 15 

the full model. In Scenario II, tapering effects are added to the generating model in Scenario w 

I. In Scenario III, the dimension of the data increases with the sample size. 17 

In Scenario I, the data generating model is ([T]) where j3 = (3,1.5,0,0,2,0,0,0)', and is 

xa,--- ,Xi p and e» are generated from N(0, 1) with Corr(xn.,xu) = 0.5' fe ~ z L This example 19 

was commonly used in literature, such as Tibshirani (1996), Fan and Li (2001), and Zou 20 

(2006). Sample size n is set as 40, 60, and 80. Three regularized procedures are applied, 21 

LASSO, adaptive LASSO (aLASSO), and SCAD. Tuning parameter A are searched among 22 

{10 _2+4fc// "; k = 0, • • • , 99}. The number of random partitions is set as B = 20. 23 

Each simulation setting is repeated 100 times. The percentage of selecting the sparse 24 
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generating model A = {1,2,5} and the relative prediction error (RPE) of the selected sub- 
model are summarized in Table 1. The average numbers of correctly selected zeros (C) and 
incorrectly selected zeros (I) are summarized in Table 2. 



Table 1: Percentage (PCT) of selecting {1,2,5} and average RPE of selected submodels 





PASS BIC C p CV GCV 


n Method 


PCT RPE PCT RPE PCT RPE PCT RPE PCT RPE 


LASSO 
40 aLASSO 
SCAD 


0.45 0.142 0.29 0.183 0.16 0.203 0.09 0.220 0.16 0.203 
0.94 0.102 0.75 0.143 0.53 0.181 0.63 0.167 0.52 0.181 
0.99 0.092 0.81 0.141 0.55 0.180 0.76 0.152 0.52 0.184 


LASSO 
60 aLASSO 
SCAD 


0.49 0.095 0.35 0.112 0.16 0.138 0.14 0.140 0.17 0.137 
0.99 0.069 0.87 0.084 0.52 0.118 0.65 0.103 0.52 0.118 
1.00 0.066 0.88 0.084 0.58 0.118 0.76 0.100 0.56 0.119 


LASSO 
80 aLASSO 
SCAD 


0.60 0.055 0.38 0.074 0.16 0.097 0.08 0.097 0.16 0.098 
0.99 0.042 0.88 0.056 0.56 0.081 0.77 0.067 0.56 0.081 
0.99 0.044 0.89 0.056 0.62 0.079 0.75 0.069 0.61 0.080 



Table 2: Average numbers of correctly selected zeros (C) and incorrectly selected zeros (I) 





PASS BIC C p CV GCV 


n Method 


C I C I C I C I C I 


LASSO 
40 aLASSO 
SCAD 


4.16 3.68 3.26 2.66 3.25 
4.94 4.59 4.16 4.25 4.15 
4.99 4.63 4.11 4.39 4.06 


LASSO 
60 aLASSO 
SCAD 


4.36 4.00 3.12 2.85 3.13 
4.99 4.84 4.17 4.35 4.17 
5.00 4.84 4.15 4.37 4.12 


LASSO 
80 aLASSO 
SCAD 


4.47 4.05 3.01 2.66 3.00 
4.99 4.84 4.19 4.49 4.19 
4.99 4.83 4.23 4.45 4.22 



Table 1 shows that PASS performs much better than the other criterions in terms of 
having the largest percentage of selecting submodel ^4. = {1, 2, 5}. In addition, if the selected 
model is used for prediction (although in practice it is way too simple for prediction), PASS 
performs better than the others in terms of having the smallest RPE. It also verifies that, in 



terms of variable selection, adaptive LASSO and SCAD perform better than LASSO and BIC 
performs better than C p , CV, and GCV. Furthermore, Table 2 shows that all the criterions 
barely (never happen in the 100 times here) select any incorrect zeros. It seems PASS performs 
much better than the others in terms of selecting the largest number of correct zeros; there 
are 5 correct zeros in the data generating model. 

In Scenario II, the consequence of adding tapering effects is examined. Three gener- 
ating models are considered: (II.l) (3 = (3,2,1.5,0.05,0.04,0.03,0.02,0.01)'; (II.2) /3 = 
(3,2,1.5,0.1,0.08,0.06,0.04,0.02)'; and (II.3) /3 = (3,2,1.5,0.2,0.16,0.12,0.08,0.04)'. Other 
setups are the same as those in Scenario I except that sample size n is set as 40. Table 3 
summarizes the average size and the average RPE of the selected submodels. 



Table 3: Average size and average RPE of selected submodels 





PASS BIC C p CV GCV 


Model Method 


Size RPE Size RPE Size RPE Size RPE Size RPE 


LASSO 
II.l aLASSO 
SCAD 


3.58 0.145 4.08 0.179 4.83 0.208 5.15 0.212 4.79 0.207 
3.08 0.133 3.33 0.149 3.88 0.187 3.83 0.174 3.91 0.189 
3.08 0.122 3.27 0.143 3.89 0.191 3.71 0.169 3.98 0.197 


LASSO 
II.2 aLASSO 
SCAD 


3.80 0.166 4.58 0.194 5.06 0.208 5.52 0.211 5.06 0.209 
3.17 0.172 3.63 0.184 4.22 0.208 4.10 0.200 4.21 0.208 
3.20 0.158 3.51 0.178 4.16 0.206 3.89 0.191 4.19 0.209 


LASSO 
II.3 aLASSO 
SCAD 


4.54 0.215 5.28 0.211 5.75 0.222 6.25 0.224 5.69 0.220 
3.36 0.260 4.17 0.249 4.71 0.230 4.67 0.246 4.72 0.230 
3.59 0.252 4.16 0.253 4.74 0.234 4.69 0.249 4.75 0.235 



Table 3 shows that PASS is more immune to tapering effects than the other criterions. In 
model (II.l), the signal-to-noise ratio (SNR) of the largest tapering effect is 0.05/\/40 = 0.316, 
and therefore it is desirable to exclude all 5 tapering effects and PASS outperforms the other. 
In model (II.2), the SNP of the largest tapering effect is 0.1/\/40 = 0.632, and still it is 
reasonable to exclude all 5 tapering effects and PASS outperforms the other. However, in 
model (II. 3), the SNP of the largest tapering effect is 0.2/\/40 = 1.264, and therefore it is 
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arguable to exclude all 5 tapering effects. Still, PASS selects sparser submodels than the 1 

others, but in some cases PASS has slightly bigger RPE than others. 2 

In Scenario III, we investigate the effects of the dimensionality. The setting is similar to 3 

the one in Scenario I except that (3 = (5, 4, 3, 2, 1, 0, • • • , 0) T , p = [y/n]. More specifically, 3 a 

cases are examined: (1) n = 100, p = 10; (2) n = 200, p = 14; and (3) n = 400, p = 20. 5 

The percentage of selecting the sparse generating model A = {1,2,3,4,5} and the relative 6 

prediction error (RPE) of the selected submodel are summarized in Table HI 7 



Table 4: Percentage (PCT) of selecting {1, 2, 3, 4, 5} and average RPE of selected submodels 





PASS BIC C p CV GCV 


n(p) Method 


PCT RPE PCT RPE PCT RPE PCT RPE PCT RPE 


LASSO 
100(10) aLASSO 
SCAD 


0.74 0.055 0.43 0.083 0.17 0.084 0.10 0.082 0.17 0.084 
0.96 0.049 0.86 0.053 0.48 0.061 0.74 0.056 0.47 0.063 
0.97 0.048 0.92 0.049 0.47 0.073 0.82 0.050 0.47 0.072 


LASSO 
200(14) aLASSO 
SCAD 


0.89 0.022 0.49 0.040 0.11 0.043 0.07 0.045 0.11 0.043 
0.99 0.018 0.90 0.024 0.38 0.037 0.66 0.027 0.38 0.037 
1.00 0.018 0.93 0.022 0.46 0.038 0.73 0.024 0.47 0.038 


LASSO 
400(20) aLASSO 
SCAD 


0.95 0.012 0.53 0.029 0.09 0.025 0.04 0.023 0.09 0.025 
1.00 0.012 0.93 0.013 0.34 0.019 0.73 0.015 0.33 0.019 
1.00 0.012 0.98 0.012 0.43 0.020 0.75 0.012 0.43 0.021 



Clearly the proposed PASS criterion outperforms other competitors in both variable se- s 

lection and prediction performance. As illustrated in Table [U PASS delivers the largest 9 

percentage of selecting the true active set among all the selection criteria, and yields the 10 

smallest relative prediction error across all cases. 11 

6 Discussion 12 



In literature, BIC is commonly used for tuning parameter selection in regularized procedures. 13 
Recently, stability selection is becoming popular. The intuition behind stability selection is 14 
that a good variable selection criterion should select similar subsets of variables when applied 15 
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to different samples of data generated from a same population. However, if there were a 
few variables of significantly large effects, then any selection criterion selecting only these 
"big" variables would be stable, and therefore applying stability selection would lead to 
under-fitting. The PASS criterion proposed here overcomes this drawback by borrowing the 
strength from cross-validation. 

Although it is showed that the PASS criterion is selection consistent, it is worth noting 
that selection consistency is meaningful only in theory because a naively simple true model 
is assumed. In practice, it is extremely important to evaluate carefully scientific aspects of 
the full model before conducting variable selection. Practically, the PASS, along with many 
other criteria, can be only treated as tools for data mining or data dredging. In other words, 
these variable selection criteria are exploratory rather than confirmatory. 

Another limitation of the proposed criterion, although it is only technical, is that the 
selected A is corresponding to sample size n/2, because each time data are partitioned into 
two halves. This limitation is common to any stability selection method (e.g., Meinshausen 
and Biihlmann, 2010), because in order to consider stability, due to that there is only one 
dataset, data re-generating has to be mimicked by some sort of data re-sampling. 

Finally, stability selection is becoming popular for cluster analysis (e.g., Fang and Wang, 
2012), an example of unsupervised learning. There is no doubt that in any unsupervised 
learning, the problem of tuning parameter selection is very difficult, because there is no 
loss function to guide the selection. Maybe stability selection can be used to select tuning 
parameters in regularized procedures proposed for unsupervised learning. 
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